Orchestration vs. Choreography: The good, the bad, and the trade-offs

00:00:05 Laila Bougria

Good morning. How are y'all doing? Did you get some good coffee? You're going to need it. Okay. You already see the slides up on the screen? Yeah. Cool. All right. My name's Laila. I'm a software engineer and solution architect. I work for a company called Particular Software, where we build NServiceBus. We also have a booth. If you have any questions about messaging or distributed systems, both my colleagues and me would be happy to answer you. Now, a couple of years ago, my husband and I, were living in this tiny, beautiful little house and I really loved it, because it was super easy to clean, so I was a happy person. But as we started to grow our family, we kind of started to outgrow it, and you could see this by the amount of spring cleanings that I started to do, but the amount of stuff that I was able to get rid of, smaller and smaller.

So, at some point we had to admit that we really needed to find a bigger house, and we had this dream of getting a big house, big garden where the kids could play outside and my husband would cook barbecues, and I would make my summer paella. Totally forgetting that in Belgium, you don't get a summer at all, right? But anyway, we still went forward with our dream house and at some point we found exactly what we were looking for, and that's when we started to talk to banks to find the best possible rate, right?

Because one of the things that my parents instilled in me is that you don't just go to the first bank and then get your mortgage. No. You compare with everything in life. So that's what we did. We basically went to, I don't know, six, seven banks to see which one would give us the best rate, because even though it's just a decimal, considering how long we're paying off these things that can really add up, right? Now, another option to avoid yourself all of that hassle is to basically go to a loan broker, which basically does exactly that.

It'll go to many different banks and get you the best rate possible. So I started to think about how could we design a system like that? So if you start to think about it, a customer comes to us and makes loan request. They are saying, "Oh, I want €500,000, I to be able to pay this back in 20 years, and I would like a fixed rate, so that I'm not susceptible to market changes," for example.

Now the first thing a loan broker is going to do is make you pay for that effort. Other than that, we're not going to do anything for you, but once you've made a payment for that service, then we will basically perform some credit scoring. Now, this is because we want to figure out whether you're even a solvable customer to give a loan to begin with, but also it'll help us when getting the eligible banks that would actually make a good fit for you.

Because the thing is that some of the banks only work with entrepreneurs or some of them only work with premium customers. So we want to make a pre-selection, because calling those banks costs us money as a loan broker. Then based on those banks that we found, we will send all of them requests to get us a quote. Now, some of them will come back to us with a response and some of them won't. Now of course, all of those banks have their own contracts and their own APIs, so we'll still have a translation effort to do there, to basically convert that back into a format that we can understand within our system boundaries, and only then can we start to compare these things and provide the best possible offer to our customer. Now, looking at these requirements, I thought to myself, "Well, there's a lot of complexity involved here, so wouldn't it be great to leverage a microservice architecture?"

Now, of course, I want to do this, because building microservices or smaller units, they're comprehensible. It allows you to make the system a lot more resilient. You can scale a lot more granularly. They can behave in an autonomous way. They're decoupled, they're maintainable, and all of the things, did I forget any of the magic words? Now, the thing is that when you think about microservices, it's like small units, like these little circles on the screen. They're really small, really easy to comprehend and easy to maintain. So you could therefore argue that building these types of systems makes your code a lot more easy, a lot more simple and a lot less complex, but there's really a dark side to this type of architecture, because the complexity hasn't just disappeared. No, it has actually moved, because even though each and every service is now small and comprehensible, and simple, the complexity is now in how those services communicate in those inter-service communication, those arrows in between those services.

We've basically taken the complexity and moved it towards that, because if you think about a microservice, well, it's simple and easy, but it doesn't really provide any meaningful sort of customer value on its own. It's only when we start to put them together and have them all communicate that we get true customer value. Now, if we all agreed that now the complexity has moved to the lines, then we also need to spend a lot more attention to when we design these lines. So what would happen if we then have more complex workflows, where we basically have multiple services that are participating in those workflows that we now need to coordinate a little bit? How do we draw those lines in a way that we can manage the complexity and stay away from that pitfall of building a distributed ball of mud? Well, as you might expect if you came to this session, you have two options to choose from.

There's orchestration and choreography. Now, orchestration brings you a sort of command-driven approach, where there's a central component that is basically managing the entire workflow and also keeping around all of the relevant state that pertains to that workflow to be able to make decisions, which is also one of the reasons that we refer to it as a state machine. Now that orchestrator based on that state will make decisions on what needs to happen and when that needs to happen by basically sending out commands to the relevant services, and then instruct them what to do, although it's not necessarily telling them how to do it, okay? In a choreographed approach, on the other hand, we have a more event-driven approach, right? Where in this case there is no central component at all that is directing, managing, or owning that workflow. Instead, what happens here is that the state becomes scattered across all of those services that are participating in that workflow.

Each of the services is basically keeping around the state that is applicable to that own service's internal boundaries. So if you want to understand what is happening, you have to also figure out which services are involved in that workflow. The way that this works is that individual services are going to subscribe to events that are meaningful to them and basically then decide what to do in the context of its own service boundary. Now, this is about the only piece of theory you're going to see this entire session, because I think it's a lot more valuable to go through actual domain exercises, which brings me back to our loan broker example. Now, let's assume for starters that we're going to attempt to design this using an orchestrated approach. Now, in this case, our user will initiate the request and that will start up our orchestrator.

Now at that point, the orchestrator says, "Okay, first I want to store some information. Then I will instruct payments to get us that money. From there on, we're going to do the credit scoring, and based on that information, we'll get those eligible banks we talked about. We'll send out all of those quote requests, and when we have all of the relevant ones, then we will ask our ranking service to basically rank them for us, and give us the best possible option."

Now, this is pretty simple to understand and it looks great at first sight, because it's easy to understand how things are happening. It's easy to understand the flow. The state of this workflow is also stored within that orchestration component, and it's centralized That way. We can also sort of investigate and even query that state to understand what the state of the workflow is. But as you might expect, there are also downsides. First of all, this requires all of these downstream services to always be available, but what would happen if one of them is down?

Because now the orchestrator has to basically introduce resilience patterns to be able to deal with those types of failures, like retries or back-off mechanisms or things like that, and what would happen if the service is actually down for a prolonged amount of time. Now, it also needs to understand how to deal with those scenarios. Another thing that stands out here is that there's a high amount of coupling in between the orchestration component and all of those individual services that are participating in this workflow. And the thing is that this can get tricky, because sometimes it can become very convenient to add additional steps here, and then the amount of coupling is basically exploding. So this approach can also lend itself to become its own little monolith, which is something that you have to be aware of. You have to manage the amount of coupling within a single orchestrator.

Now another thing that stands out is all of that communication between those services has to basically go through the orchestrator. That's the single point of truth. So all of the information has to flow through it, so it can keep all of the relevant state and make decisions on what to do next, but that can also lead to high contention on the orchestrator, putting a lot of stress on that component, at which point we could say, "Okay, let's scale this thing out." But that is also tricky, because when you have a bunch of those workflow instances running, you usually want to route them always to the same instance of that orchestrator. So now you need to concern, think about load balancers and stuff like that, right? And it's basically concerns like these that tend to lead to the idea that the orchestrator is now a single point of failure, because if the orchestrator is not available, the workflow can progress in any shape or form, because that's your main dependency.

And if you consider what we need to do, if we want to add a step, let's say, okay, our customer paid for this loan broker service. Well, we also want to provide them with an invoice, so we create a dedicated service for that, because there's also sufficient complexity in there, but it's not sufficient to then create that service. We also need to go back to the orchestrator and make changes there, so that it will call the invoicing service when applicable. So now to add a step, you are changing two components. Wow, that's a lot of additional friction just to solve this coordination problem in a more distributed system. You know what? Let's just try choreography instead. Maybe that makes our life a little bit easier. Well, the flow is a little bit different here, because we don't have command-driven communication anymore, but we have event-based communication, so we will be publishing events and other components will be subscribing to those.

So the quoting service publishes an event that a quote was requested, credit scoring. First, our payments will happen, from there on credit scoring, we'll subscribe to that. Quoting will then take care of all of getting the eligible banks and the quotes, and then publishing all of the quotes that we received from those banks, at which point ranking can do its work and provide us with the best available option. Now, this approach basically solves all of the challenges that we talked about previously with orchestration, because now we don't have any temporal coupling. That means all of those underlying services, if they're down, fine, our events are safely stored on that event bus, okay? We have introduced asynchronous communication. Who was in my talk yesterday? Okay, wow, welcome back. Well, so you basically then all already understand this. We had synchronous communication before, and now we are looking at asynchronous communication that allows us to get rid of that temporal coupling.

We also don't have a centralized component that is coupled to all of those individual services anymore, okay? That's also out of the way. And if we need to scale, well fine, then we just take the individual services that are suffering a bit, that need a little bit more juice, and we will scale those out or up, whatever is the best fit. There's also no control coupling anymore, which means that there is no individual component telling other components what to do. All of our services are autonomous and just reacting to things that are relevant to them. And finally, if we would then want to add a step, it becomes super easy, because now we introduce our invoicing service, we will just subscribe to the relevant event and this will just work. It's that simple. Now, the thing is that when you look at all of the advantages that choreography gives you, there is one thing that stands out in this list, because something is repeated, and that is the fact that there's no coupling or at least less coupling, which we'll talk about in a minute.

Now, when we consider coupling in the context of these coordination mechanisms, then you could argue that basically in an orchestrated approach we see a lot more coupling, whereas in a choreographed approach, we see a lot less coupling, and usually that's what we want to do. We don't want to remove coupling entirely, because then our system is not operable. Something that is truly decoupled does not work together, but we want to manage and balance that complexity to a way that it is manageable and understandable for us.

But there is really one big flaw in this assumption, because which type of coupling are we even referring to? Now, I have an Italian friend. He's really, really, really proud of his culture and his country, and if I would tell him that all of these spoons just contain some flour and water, and therefore are all pasta and would get really, really pissed at me. And it's the same concept, because just like all of these different types of pasta serve different types of meals, we have different types of coupling in our systems that we need to understand, because they also introduce different types of problems and make us make also different types of trade-offs, okay?

So let's consider some different types of coupling that can actually affect us when we try to design these inter-system interactions. First of all, afferent and efferent coupling. Who's heard of that before? Okay, a couple of hands. Well, afferent and efferent coupling was introduced years ago with object-oriented programming, and basically described the coupling in between objects, those object interactions, right? But in a microservices world, we can use the same concepts to also reason about how our services are coupled and how they interact.

So afferent or incoming coupling basically refers to all of the services that call and therefore depend on you, whereas efferent coupling refers to all of the outgoing connections, all of the services that you call, and therefore you depend on. Now, if you connect this back to the coordination approaches that we've been talking about, then you could start to see that afferent coupling is a lot more present in a choreographed approach, because you're subscribing to a lot more services here, so you depend on a lot more services, whereas a choreographed approach actually has a lot more outgoing or efferent coupling, because you have that orchestrator calling all of those other services.

What is really important for you to understand is that the coupling is still there. I hear this all the time that, "Oh no, we're subscribing to an event. There's no more coupling."

Yes, there is. There is still coupling, there's still an arrow. What we've basically done if we've reversed the direction of the arrow, service A is not coupled to service B anymore, but it's now service B that is subscribed to events from service A and therefore coupled to that, okay? Now one of the arguments that comes up all the time when I have this conversation is that, "No, there is no coupling, Laila, because now we're communicating across a piece of infrastructure. There's an event bus or a message broker in the middle."

But the thing is that there is definitely still coupling if that service would never publish the event, well, the other services would have nothing to do. There's nothing to react to. And even if you think about that event that is being published, if that would change in any shape or format, then that would affect all of the services that are subscribed to it. So there is definitely still coupling. There is that type of contract coupling, if you will, but there is I guess some truth to this argument, but in a form of a different type of coupling, because what it does remove is temporal coupling, right? Because now we are not immediately dependent on that service being available at that specific moment in time. So we achieve temporal decoupling and the service is down. Well, fine, our messages, our events safely stored on our message broker, but this is where I see another misconception that comes up all the time, and that is that I've seen a very high association with the orchestration pattern and this assumption that therefore you're using synchronous communication, but that's not necessarily true in synchronous communication.

We have the caller sending a request to the receiver and blocking, waiting until that has been processed, but there's absolutely nothing stopping us from using asynchronous communication in an orchestrated approach, or even from using synchronous communication in a choreographed approach. It's a bit unnatural, but it is possible. I wouldn't necessarily immediately recommend it, but there are possibilities like that, and if you're interested, I would highly recommend that you read through this book, but it goes very, very deeply into all of these options, and I will have it available in the resources as well. Now, the key point that I really want you to remember is that it's really important to differentiate the sort of communication pattern, which can be synchronous or asynchronous communication from your coordination mechanism, which can be orchestration or choreography, okay? Those are distinct things, different types of decisions. Now, if we would use a combination where we use orchestration based on asynchronous communication, we get a whole different picture.

Now, this is exactly the same flow as I showed you all the way in the beginning. It looks a lot more noisy. There are a lot more arrows going from here to there, and that is because our communication is now asynchronous, right? There's a message going from the orchestrator to the service, and the response will basically go back in a different message. Those things are now decoupled. That communication is not happening through our message broker, and that's what gives us that temporal decoupling, all right? We also still have our command and our contract coupling, because we have the orchestrator commanding individual services what to do, and it knows the contract it needs to use to instruct that service what to do. Although again, it's not telling it how to do it, there's still a lot of autonomy in the individual service, but that orchestrator is still subject to high contention.

All of that communication is still flowing through the orchestrator. It still needs to be the single point of truth. So that's still a concern, but it's a little bit different now, because we are storing all of those messages on a queue. So that gives us a little bit more flexibility if we want to start to scale these things. And when it comes to the idea that this is still a single point of failure, well, I would argue that it's not much more different than any of the other services that participate in the workflow, because if the orchestrator would become unavailable, all of our messages or events are safely stored on the message broker, and when it's back available, it can just continue where it left off.

Now, another thing that I also want you to consider is the coloring of the arrows, because I swear, this was so hard to create in PowerPoint, I really suffered. I didn't do this for nothing. There's intentionality in the colors, because those blue arrows, they represent commands. The white arrows represent a response to that command. It's basically just a message in the opposite direction routed back to the originator, if you will, to the orchestrator. But there's another design decision that we need to make here, because as an individual service, when you receive and process that command from the orchestrator, you can choose to reply to the orchestrator, or you could also choose to publish an event and actually make that information known to the wider system, allowing also the orchestrator to basically subscribe to that event, right? One example could be the requirement that we had around credit scoring, because usually when a customer comes, they ask for multiple types of quotes.

One is for €500,000, one is for €550,000, and so forth. The credit scoring, we don't want to keep repeating that, because it costs us money. So at that point we could say, you know what? Let's publish an event there instead of just replying to the orchestrator, so that we can capture that information outside in another service and keep that around. Another option is around payments, right? The payment is complete. Maybe we want to publish an event there, so that we maybe have an accounting service that is now subscribing to that information as well. Or even as a last example ranking, because maybe when we've ranked all of our offers, we want to keep track of that in another component that will rank banks and tell us what are usually our best banks that provide the best possible offers to our customers, and have that insight as well.

My point is that you have to make an intentional decision, but you will also find that sometimes it can be interesting to also reply to the orchestrator and actually giving that a lot more information, especially if the orchestrator is inside your service boundary, then you can share a lot more data, but the event that you want to publish needs to be a lot more condensed, because you have absolutely zero control over who can subscribe to that public versus internal or private data, because once you basically publish all of that data, there's no way to make it private again. So that's also going to be a consideration here.

Now, one of the things that sometimes I discuss with people when I show them this orchestrated approach based on asynchronous communication, and they see both commands and events, they get confused and they ask me, "But wait a second. Now you're also using events. Is this still really orchestration? I mean, it seems like you're mixing things here." But the thing is that this is most definitely still an orchestrated approach. I still have a single component that is responsible for my workflow. That is the main thing by which you can recognize orchestration. That workflow is handled in a dedicated thing that is responsible for it. That state is managed there. That's the component, making decisions on what to do and when to do it, and what to do next. So it's most definitely still orchestration.

So where are we now? Well, with this asynchronous orchestration approach, we now got rid of temporal coupling, because we're now using a message broker and using asynchronous communication to communicate between the orchestrator and those individual components. We also got rid of that direct service coupling, because of that broker in the middle. There's still contract coupling, right? We are aware of the contracts that we need to call those individual services, and we still also have that control coupling. But again, the upside of control coupling is that you have a single component that owns the process flow, the workflow, what needs to happen and what comes after, and what are the prerequisites of each step, and having that in a single piece of code that you can navigate to, and read and try to understand how these things work together can be really valuable in understanding your system.

The state is also still centralized in a single place, but that can also be really valuable, because if you need to understand what the state of the workflow is, there's a single point that you can now query. You don't have to look at all of the individual services, but again, we haven't really answered that initial question of which pattern is now better? Is it orchestration or is it choreography? But I haven't even enabled us to be able to answer that question, because I have yet to talk about all of the downsides or challenges that a choreographed approach will introduce in our systems. Now to discuss that, let's consider a completely different workflow inside the banking domain. Let's say that I've considered all of these offers that the loan broker gave me, and I went with the best possible option. Now, at that point, there's a whole other flow that will kick off inside the bank's domain, and having worked on exactly this domain at a Belgian bank for almost five years, there's one or two things that I can tell you about how that would work.

So what would happen is now we're going to accept that offer, right? We're like, "Okay, we're ready. Put the sort of strand around our neck. We'll take the mortgage, thank you." At that point, the bank will again do an ID check and a credit check. Yes, they trust a loan broker, but it's always a trust but verify type of situation. Once that is done, then you're going to have to sign documents until you feel your wrist is going to fall off. Once that is done, we also, if you don't already have one, we're going to have to create an account at this bank, and the bank is also going to verify that there's actually an incoming stream of money, so that they actually have some certainty that they're going to get their money every month. Once that is done, that's when we can actually start and create that loan.

Part of doing that is also calculating interest provisions. Now, that amount of money that you pay every month, part of it is a capital repayment, and another part of it is an interest repayment, right? That interest is basically the sort of money that the bank earns from lending you that money. And what they tend to do with that, is they set aside a part of it to be able to deal with bad assets, people who are not able to pay back their loans, or even to immediately reinvest it to keep that money growing. So we immediately want to calculate what are the amounts that we can set aside, so that the process is simplified after.

Another thing we want to do is link up your repayment account, so that there's actually a standing order, so that there's money coming off your account every month. And then, finally, we can basically schedule the payout to the notary, because I don't know how it works here in Norway, but in Belgium, if you get a mortgage, there's absolutely no way that you're ever going to see even half a cent of that money in your bank account, not even for a split second.

They are way too concerned that I would take the money and run off to The Bahamas, which I might. But instead, what they do is they basically pay out all of this money to an intermediate party called the notary, a party that is overseeing the sale and the acquisition of that house, and only when the sale is complete will they transfer the money to the seller. So when you go to the notary, again, I swear at the end, you have a cramping risk. You have to sign a bunch more documents, and we're going to scan those, and only when that has been scanned at the bank will your loan actually start running. So let's take this approach with choreography. Now in this case, we will accept the quote and that quote will be accepted. There's an event that is published for that. Then we will do our ID check and also our credit scoring check.

And only when the credit scoring is done, loans will subscribe to that event to then create the loan, and also generate based on when the documents are signed, we will generate provisions. We will set up that payment to our notary and make sure that we have that repayment account also ready to go. But let's consider all of the downsides to this approach, and the first one I want to talk about is business failures. Now, I'm specifically saying business failures and not technical failures, because if you encounter any technical failures, there's a payment provider down or some other service, you need to use resilience patterns to deal with those. But sometimes we can also have business failures, which are basically sort of alternate flows that can happen when things don't really go the way we intended, the way we expected in the beginning. It's not even necessarily a failure, if you will, but there's something in your workflow that is branching off and now has to deal with things differently.

Now, let's consider that I'm all excited we're going to buy this house, then we go over to the notary and the sellers back out. Well, at that point we have a problem, because the money, well, the notary has it, so at this point, the notary will have to refund that money back to the bank. So we have an arrow going in the opposite direction. Also, those interest provisions and that repayment account, those are things we need to undo. We don't want the interest provisions anymore. The money's never going to come in anyway, so there's absolutely no need in having those. And also that repayment account, we need to unlink it, because you're going to be very pissed off if they end up taking money from your account for a house that you didn't even get. But we also may want to make our quoting service in the beginning aware that the offer, even though it was accepted, was never really fulfilled.

The loan was never really created. What is really important for you to remember from this example is when you look at those compensating flows, how many arrows are you creating in the reverse direction to be able to deal with those scenarios? Because if we would try to implement this here, now we have our loans service that needs to subscribe to that payment refunded event from the notary, right? So what happens now is that we are creating bidirectional coupling, because previously payments were subscribed to events from loans, but now we have loan subscribing to events from payments. Those services are coupled back and forth.

When it comes to undoing the provisioning and unlinking that payment account, we can just publish a loan canceled event and they can do whatever is necessary, and there's no additional bidirectional coupling we're creating there, but we have additional coupling when the quoting service now also needs to be subscribed to that loan canceled event. The thing is that you want to look out in your pictures when you draw this, is how many arrows do you create in the reverse direction? Because if now you have to hop five services back to be able to compensate for such a business failure, if you will, there's probably a lot more complexity than you initially intended.

Now, these compensating flows, like I said, are basically alternative flows beyond the happy pass, something that you didn't immediately expect at the beginning. And usually these require some of the previous actions that were completed to be undone or changed. Now, this always reminds me of knitting, which is one of my favorite hobbies, and sometimes life happens and I drop a stitch, and that's actually fine, because usually I can pick it back up and fix it while I'm going. But sometimes my mistakes are so big that I have to unravel part of my work, but that doesn't necessarily mean I need to start from scratch, right? You don't necessarily always need to return to the initial state, but you're going to have to take some compensating actions. But what is important here, that doing that type of compensation usually requires you to interact with the same underlying services that you use to get here.

So if you have a lot of that bidirectional coupling, orchestration might actually be a better approach because it was already coupled to those services, and there's no additional coupling that you need to be able to implement those flow. It's also a lot more easy to implement them, but there are even more downsides, like passive-aggressive publishers, but most of you were in my session yesterday, so you already know this one, but for the ones who weren't, I always like to explain this with an analogy, because my husband is the one that usually cooks at home, right? And I really, really appreciate that. So I think it's only fair for me to clean up the kitchen, but there are some of those days that I come into the kitchen and it's like, "Oh my God, really?" Every pot pan that we had in the house, everything we had is now dirty. Really.

So at that point, I could publish an event and state that the kitchen is messy, but there's a big problem with that, because first of all, I expect my husband to be listening to me. I expect him to be subscribed to that event. And on the flip side, I also expect him to actually do something. I want him to walk into the kitchen and help me out. And that is passive-aggressive communication, not good in your relationships, and definitely not good in your systems. So whenever you find something like that, where you publish an event, but from the internal service boundaries perspective, you expect certain things to be done, to be able to have a consistent state, then something is off. So you should be asking yourself two different questions. First of all, you should ask yourself, "Okay, if I'm in such a situation and I can see that this service requires that, then maybe it should be part of that same service boundary."

Let me show this with an example. What stands out here is that we have this banking component that is now subscribed to the document signed event to link that payment account. So from the loan's perspective, that is passive-aggressive communication, because the loan service boundary requires a repayment account to be linked, because otherwise it will never be able to get back its money. So two things to ask yourselves is, if you really need that to be done, should that not be a responsibility that is also contained inside the loan service boundary? But if that's not the case and you're like, "No, because I'm in this scenario where it shouldn't be and it's right, the service boundaries are right." Okay, fine, but then use command-driven communication. Don't use a publish-subscribe mechanism. Make your coupling and your intent clear, so that you can also deal with some of the side effects that may happen if that doesn't end up happening.

But that's not even all, let's even consider additional challenges. Let's say that we had a regulation change. I remember this happening at some point. We had the ID check that we had to do. We had the credit scoring that we also had to complete, but at some point they said, "We also want to do a background check, because we want to avoid to loan money to criminals, basically."

So we were like, "Okay, well, easy enough, we can just introduce this background check service." And it's easy, right? As I said in the beginning, now we can just subscribe to an event, then we're fine. But in this flow, that's not really true, because our loan was created when loans were subscribed to that credit scoring event. But now we also need to wait for that background check to happen. Okay, you know what? We'll just remove that subscription and we will subscribe to the background check event instead. Be aware that now you're already touching two services again to basically even be able to implement this.

Another thing to look out for is that we've actually made things a lot worse, because now we've made these steps dependent on each other. First, the ID check, then the background check, sorry, then the credit scoring check, and then the background check. But there's absolutely no business requirement that tells us that they are dependent. The only thing that is required in the beginning is the ID check, and once we have that done, it doesn't matter whether we do the credit scoring first or the background check first. Well, you could say, "You know what? Let's just have loans subscribed to both of those events. We'll subscribe to credit scoring, but also to the background check event." And that would then work, right? Well, yes, but now loans also has to keep track of that state, because it has to understand when was the credit scoring done and when was the background check done?

It will have to store that information, so that we'll know when it has to continue. State management, handling complex prerequisites. Wait a second. Wasn't that what orchestration was for? So these are the things that you need to look out for. It can be really easy in a choreographed approach to add additional steps to the tail, just subscribe and you can just publish an individual service and we're good. But if you need to make impactful changes to the workflow, it becomes a lot harder and it can start to impact multiple services. The thing is that in a choreographed approach, we don't have that workflow in a single place, rather, that workflow and that behavior is emerging from all of those event chains. So one of the things that you could do to mitigate this type of an issue is to find areas that are prone to change by engaging in those conversations with your business stakeholders.

We talked about this again a little bit yesterday, practice finding anti-requirements. Try to ask your business stakeholders questions. Could it be that we would ever introduce this type of a requirement or that type of a requirement? Whenever they say, "No, of course not, that would be absolutely crazy." Well, then you have a lot more security that that's going to be a stable piece. But when they say, "Oh, these things might actually change." Okay, then you need to consider them. I'm not saying then don't use choreography, but I am saying take your designs and look at what would be the necessary changes to actually implement all of those plausible scenarios. What would be the impact of all of those changes? How many services would that impact, right? And that's when you can start to see what would be the risks, because that's just a healthy part of decision-making, is to also consider the risks of the decisions that you make, even your design decisions, so that you can understand the trade-offs that you are making when you design your systems.

Everyone's still with me? I have a couple more, okay? Now, versioning, that's another tricky one, because versioning a workflow in a distributed system, that is hard independent of any coordination mechanism, whether you're using orchestration or choreography, it's always a little bit of a challenge, or a big challenge. But in a choreographed approach, it's even a lot more challenging, because like I said, there is no workflow really, right? The workflow is emerging from all of those event change. I find this a very sort of funny term, emerging behavior, because for me it sounds like I have no idea what the system does. Let it just happen, and I'll tell you what it does. That's really what we are saying.

Now, the thing is that you need to start thinking when you introduce a new requirement, like that background service check, "Okay, how does this affect all of my in-flight workflows? Are they now also susceptible to the service check or is it only the new ones? Or how should this work, and how many services does that affect?" Because that's when it gets ugly, when versioning starts to affect multiple services, and you now have services that are doing, if not null checks in events to see whether some data is there and whether some data is not there. And you start to see some of the concerns that were once previously kept beautifully inside one service boundary leak into your other service boundary, because of versioning issues.

And then, there's one more challenge to discuss, and that's the lack of overview, because as why I have said a couple of times with orchestration, your workflow is in a single piece of code. You can go to it, read through it, and reason about how that behaves. State, also in a single place. I can tell you what the state of the workflow is. In choreography, now, emerging behavior, let me go figure that out for you. Give me a second. And that's why in these types of situations, observability really comes into play. You need a way to be able to truly understand and explain to peers, to your business stakeholders, to your managers what the system is doing. So you need to have an observability strategy. Happy to talk to you about that, but it's outside of the scope of this session.

Because if you don't, there are three points of frictions that tend to come back. Sorry about that. That leads to basically your pinball architecture, where you're now publishing an event and it's like ding, ding, ding, ding, ding across the system. One event, another event, and another event, and you're like, "I don't know where that came from."

That's when you start to see issues, when you need to understand the state of the workflow, because first you need to understand all of the services that are even involved in the workflow and query all of their state, and then put that together and try to reason about it. It's tricky, especially when you're troubleshooting issues that are happening in production, then it becomes really, really painful, because there's someone at your door and saying, "Come on, what's up? Is the issue already fixed?"

And you're like, "I'm still just trying to figure out where we are."

And that becomes really tricky when you have multiple teams that are involved and everyone starts pointing fingers at each other, because this team is saying, "Well, we just subscribed to the event from that team."

And that team is saying, "Oh, wow, we just propagated information that we got from that team."

So everyone starts pointing fingers. And it's kind of funny, because if a manager steps in a situation like that, it's simple. If he can't find anyone that is responsible, guess who is responsible? Everybody, nobody goes home. So like I said, observability really becomes important to also be able to manage these types of things, to understand how the system is behaving. That can also be a very good reason to implement some distributed tracing, even if you're just keeping a 1% sample, so that you can get an idea of how your system interacts.

Now, who's feeling like this? Like, "Oh my God, you've just given me challenges and problems, and trade-offs, and I have no idea what to choose."

Yeah. Well, I also felt like this many, many, many times, because this is a difficult exercise, but it's difficult. It is not possible. But the reality of the question from orchestration versus choreography is that there is no versus, right? Any well-architected, well-designed distributed system will have a little bit of both, and it becomes important to understand in which context one pattern is better suited than the other. It depends. I know we love and hate this, and that's exactly why I want you to be able to walk away with a more tangible decision framework, something that goes beyond it depends and will put you on the right track to making a decision every time. And that basically consists of five questions that I tend to use to lead me in the right direction.

The first one is to ask yourself, what type of communication is best suited in your scenario? Would synchronous communication be needed or is asynchronous communication also an option? The answer to that question will be dependent on both business requirements and technical requirements. Do you need the answer immediately or is it okay to wait a little bit for the answer, so that we can have more resilience, right? Even scaling can be a concern that affects this decision. But what is important to know is that if you choose to go with synchronous communication, you are more likely to use an orchestrated coordination mechanism with synchronous communication. Although again, it's a spectrum. It's not a binary yes or no. If you choose asynchronous communication, you can go both ways, then you'll have to look at all of the other questions to be able to make a decision.

The next question to ask is which direction of coupling makes sense here? Do you want to use command-based communication or is an event-based approach a better idea in your situation? Do you have complex prerequisites? If you're in a case where you have a lot of those complex prerequisites, like we saw with the credit scoring and the background check, those are usually the things that I look out for, where I'm like, "I need a little bit more control here, because there's more complexity on when I can do certain things." And that can tend to lead you to favor orchestration. On the other hand, if you want event-based type of communication, well, you're more likely to use a choreographed approach.

The third question to keep in mind is asking yourself, "Are there many complex compensating flows," right? Like I said, those sort of sub-workflows or side workflows that can emerge when things don't go according to plan, when things happen outside of that happy path, how many bidirectional coupling with something like that introduced in your system? Because the more complex flows you have, the more likely it is that orchestration will be a good fit for you to basically manage and balance that coupling better, to not create that bidirectional coupling all over your system. But if you're saying, "Oh, I don't have that many compensation flows," or, "They're not even that complex, they don't introduce a lot of bidirectional coupling or even not at all." Fine, then you are still very good to go with a more choreographed approach.

Another question ask yourself is, "Do you have a high probability of change," right? Are you working on a relatively stable domain? And how confident are you that you understand those business requirements well enough? How do those change scenarios? Look, do you have access to business stakeholders to ask them questions? Because sometimes it's not that we just make bad decisions, sometimes we make the best decision that we can based on the information we had available, because we didn't even get the chance to talk to our business stakeholders. To who has that happened already? Yeah, many hands, it's tough. But if you're in a situation like that, you probably need to be a lot more flexible. You will need to be able to adapt to change. And doing that in a choreographed approach can be very painful. So again, lots of probability of change, orchestration probably a bit better, less probability of change, stable domain, stable requirements than it might make sense, and I'm saying might make sense to choreograph.

And finally, the other question to ask yourself is who is responsible for this entire workflow? Is this even a business-critical workflow? Because if it's not, then it maybe doesn't matter as much, but if it is business-critical, okay, who would you walk to if something would go wrong there? Which team would be responsible, which set of individuals? This is where we can start to ask those questions about ownership, about responsibility and about accountability as well. If you have a high sense of responsibility, orchestration gives you that high visibility, right? It gives you that workflow centralized in a single component that you can give to a single team to own, that you can walk to. But on the other hand, if it's not that important, well then a choreographed approach gives you a lot more flexibility.

So these are the five questions that I tend to continuously use and keep in the back of my mind when I need to make a decision in a complex workflow about which coordination mechanism I should use in a specific scenario. Now, there's one more thing to consider. Again, if you didn't get to take a picture, I have this available in the resources at the end. It's important in this exercise to always guard the scope of your workflows, okay? Because the bigger your workflows are, the more scope you have in there, the more things you are trying to deal with, the more likely you're going to end up with orchestration, because you'll need more control, you'll have more compensating actions, and you'll have a lot more of that complexity that you want to be able to visualize and understand. So this is super, super important, is to continuously look with a critical eye and ask yourself, "Which are the parts that I can isolate from this workflow and take on in a dedicated workflow?"

If you remember earlier, at some point, we also had to create that account. That's not part of the workflow. That's a completely different workflow. This is a simplistic example, but those are the things that you want to look out for. What is independent here? And another thing that I also always do is draw. Pencil and paper, people, back to basics. Or I also have a reMarkable that I tend to use for this. It's still digital, but it allows me to also erase and draw, and it's super flexible. Yeah, I see someone. Awesome, right? Yeah. So what I tend to do is I will draw both coordination styles and see how my services are interacting. Look at the arrows, because you don't tend to see that in code, especially with the publish subscribe mechanism. It's a lot more invisible. You don't have any service calls anywhere, right?

You're not even sending a message somewhere. So having that drawn, you can start to see the arrows. How many arrows do I have? How many arrows do I have in a bidirectional way? And visualizing that is a very, very good tool to then take with your peers, have discussions with your team, and start to reason about these things. Even with your business stakeholders, you can take those drawings and try to uncover maybe some hidden requirements, maybe some things that they say, "Oh, we might want to introduce this in the future," or, "We may want to change that," or, "That is never going to change."

Uncover false assumptions. Sometimes you believe that things work a certain way, but by having those conversations with your business stakeholders, you can uncover certain things that may have been just false assumptions. And it's really by going through that trade-off analysis that you get to the right choice, especially if you're like, "No, I'm sure orchestration is a good idea." You have to be your own devil's advocate and question yourself, "Why do I think that this is the best idea? And let me try to do the opposite and come up with all of the good reasons that a choreographed approach could help me forward here."

And the inverse, it's important to continuously question that, so we fight those underlying biases that are sometimes driving us. Now, if you've already sort of read about this topic before, talked about this topic before, you might asking yourself, "Laila, why are you making it so complex with me, with all of these five questions and all of these things that I now have to do? Have you never heard of the rule of thumb, to orchestrate within a service boundary and choreograph across service boundaries?"

Who's heard that before? Okay, well, generally speaking, I agree with this advice. This is a good guideline. But the reason I don't like it, is because there's one big flaw, one big assumption hidden away in this advice. It assumes that you got your service boundaries right, which tends to be the problems to begin with. Almost everywhere I've ever worked in almost two decades in my career, right? Now, I'm not saying that you should therefore never care about your service boundaries anymore. It's quite the opposite. But what I'm basically saying is, go through all of those questions. Ask yourself all of these things, see where you end up, and then use that as a tool to actually reevaluate whether you got your service boundaries right. Because if you end up with a situation and it's telling you to orchestrate, and you're looking, "Wow, I need to orchestrate across three service boundaries." Well, that's a moment to ask yourself, "Did I get something wrong here? Why is it telling me this?" Because usually it tends to point out that something is flawed in there.

It's important that, although these rules of thumbs can be helpful and they could be helpful guidelines, that we always make the right decision according to our context, according to our current restrictions, our understanding of our business requirements. Even if we figure out that we got our service boundaries wrong, that's not an easy thing to change. It really requires some planning. Now we have to coordinate across multiple teams, and we're going to have to plan how we change this. And that takes time. And then, sometimes it might make sense to go against the guideline for some period of time to give you that flexibility to address those underlying issues, instead of choosing something that then is going to make your life incredibly difficult, all right?

But even when you take your time and you go through all these thorough decision making, and all of that trade-off analysis, and you make well-balanced decisions, there's still no crystal ball. And sometimes we are still confronted with change, and that's okay. It's something that we have to embrace instead of resist. Because the thing is, the longer that we fight it, the more it's going to make it difficult for us. And I've actually seen many applications therefore sort of fall off track, where we started of good, but where there are people. And I used to be part of that group where I thought, "But I designed it this way and it was so beautiful, and I don't want to change it." But we can't really be attached like that. Code is not our baby. We have to always also understand that when you realize that your decision from a year ago was not the best decision, that is not a failure, okay?

It's really important to take that away from this session, because the thing is that that was a year ago. That was probably the best decision that you were able to make at the time with the context that you had, with the knowledge that you had, and with the business requirements that you had available. Now your insights have progressed. You better understand the system, you better understand the requirements. And now with that progressive insights, you come to a different conclusion. Fine, accept it. You have found a better way. Look at it as a win, not a loss.

And that brings me to the end. Now, there's so many trade-offs I wasn't even able to put into this session without making it even more overwhelming, but I tried to put in the most sort of common scenarios. If you want to hear more, then definitely come find me, but I hope, if anything, I've convinced you that there is no right or wrong pattern. Both choreography and orchestration should be part of a well-designed, well-architected distributed system, okay? What is important is that you continuously try to identify sub-workflows that you are able to isolate and make independent decisions for. And then, you can use that five-question framework that I presented to you today to make well-balanced decisions on which coordination mechanism is more likely to be a good fit for you in your application. When you have an answer to that question, and that doesn't really fit with the rule of thumb that we talked about, reevaluate your service boundaries.

See where you have possibly made, not a mistake, but where your insights have progressed and now you've found a better way to basically split things apart and react also accordingly, implement those changes. It's important that you're able to identify the trade-offs that you are making, because there are no perfect choices. It's just what are the downsides that you are willing to live with based on the gains that you have. And finally, also, like I said, reevaluate your decisions when your requirement change, or even when your understandings of those requirement change and progress. All right, thank you very much for listening. Again, I have a QR code available for you to scan if you want to do some additional reading, see some additional resources. And if you have any questions, I'm still around at Particular Software's booth until, let's say 2:30 or so, and then I'm back off to Belgium. And if you haven't grabbed one already, we have books available about the fallacies of distributed computing. It's a really, really useful read, so definitely grab one before you go. And thank you for joining me.

Orchestration vs. Choreography: The good, the bad, and the trade-offs

About this video

🔗Transcription